Mixture of Latent Trait Analyzers for Model- Based 
Clustering of Categorical Data 

Isabella Gollini^ & Thomas Brendan Murphy* 

^National Centre for Geocomputation, National University of Ireland Maynooth, Ireland 
*Scliool of Mathematical Sciences, University College Dublin, Ireland 

Abstract 

Model-based clustering methods for continuous data are well established and commonly used 
in a wide range of applications. However, model-based clustering methods for categorical 
data are less standard. Latent class analysis is a commonly used method for model-based 
clustering of binary data and/or categorical data, but due to an assumed local independence 
structure there may not be a correspondence between the estimated latent classes and groups 
in the population of interest. The mixture of latent trait analyzers model extends latent 
class analysis by assuming a model for the categorical response variables that depends on 
both a categorical latent class and a continuous latent trait variable; the discrete latent class 
accommodates group structure and the continuous latent trait accommodates dependence 
within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult 
because the likelihood function involves an integral that cannot be evaluated analytically. We 
develop a variational approach for fitting the mixture of latent trait models and this provides 
an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated 
on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the 
U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much 
better fit than either latent class analysis or latent trait analysis alone. 

1 Introduction 

Model-based clustering methods are widely used because they offer a coherent strategy for clus- 
tering data where uncertainties can be appropriately quantified using probabilities. Many model- 



based clustering methods are based on the finite Gaussian mixture model (eg. Celeux and Govaert 



1995 Fraley and Raftery 2002 McNicholas and Murphy 2008 ) and these methods have been suc- 



cessfully applied to the clustering of multivariate continuous measurement data; more recent ex- 
tensions of model-based clustering for continuous data use finite mixtures of non-Gaussian models 



(eg. |Lin et al.| |2007[ |Karhs and Santourian[ |2008t \Lin\ |2010t [Andrews and McNicholast |2012[ ). 

Categorical data arise in a wide range of applications including the social sciences, health 
sciences and marketing. Latent class analysis and latent trait analysis are two commonly used 



latent variable models for categorical data (eg. Bartholomew et al. , 2011). In the latent class 
analysis model, the dependence in the data is explained by a categorical latent variable that 
identifies groups (or classes) within which the response variables are independent (also known as 
the local independence assumption). Latent class analysis is used widely for model-based clustering 
of categorical data, however, if the condition of independence within the groups is violated, then 
the latent class model will suggest more than the true number of groups and so the results are 
difficult to interpret and can be potentially misleading. The latent trait analysis model uses a 
continuous univariate or multivariate latent variable, called a latent trait, to model the dependence 
in categorical response variables. If data come from a heterogeneous source or multiple groups, 
then a single multi-dimensional latent trait may not be sufficient to model the data. For these 
reasons, the proposed mixture of latent trait analyzers (MLTA) model is developed for model-based 
clustering of categorical data where a categorical latent variable identifies groups within the data 
and a latent trait is being used to accommodate the dependence between outcome variables within 
each cluster. 

In this paper, we focus on the particular case of binary data and we propose two different 
mixture of latent trait analyzers models for model-based clustering of binary data: a general 
model that supposes that the latent trait has a different effect in each group and the other more 
restrictive, but more parsimonious, model that supposes that the latent trait has the same effect 
in all groups. Thus, the proposed family of models (MLTA) is a categorical analogue of some 



parsimonious model families used for clustering continuous data (eg. Fraley and Raftery, 2002 



McNicholas and Murphyl |2008[ [Andrews and McNichoias[ [2012| ). The MLTA family of models is 



most like the PGMM family (McNicholas and Murphy, 2008) which is based on the mixture of 



factor analyzers model (Ghahramani and Hinton, 1997) however the MLTA model accommodates 
binary response variables instead of continuous variables. In addition, the model can be easily 
applied to nominal categorical data by coding the data in binary form. The MLTA model is a 



generalization of the mixture of Rasch models (Rost, 1990), and is a special case of multilevel 
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mixture item response models (Vermunt, 2007); these connections are explored in more detail in 



Section 3^, after the MLTA model has been fully introduced. 

Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood 
function involves an integral that cannot be evaluated analytically. However, we propose using 



a variational approximation of the likelihood as proposed by Tipping (1999) for model fitting 
purposes because it is extremely easy to implement and it converges quickly, so it can easily deal 
with high dimensional data and high dimensional latent traits. 

The mixture of latent trait analyzers model is demonstrated through the analysis of two data 



sets: the National Long Term Care Survey (NLTCS) data set (Erosheva et al. , 2007) and a U.S. 
Congressional Voting data set (Asuncion and Newman, 2007). The latent class analysis and the 
latent trait analysis models are not sufficient to summarize these two data sets. However, the 
mixture of latent trait analyzers model detects the presence of several classes within these data 
and the dependence structure within groups is explained by a latent trait. 

The paper is organized as follows. Section [2T provides an introduction to latent class analysis. 
Section 2J2_ provides an introduction to latent trait analysis with a description of three of the 
most common techniques to evaluate the likelihood in this model: the Gauss-Hermite quadrature 



approach (Section 2.2.1), Monte Carlo sampling (Section 2.2.2), and the variational approach 
(Section 2.2.3[ ). In Section [3] we introduce the mixture of latent trait analyzers model and a 
parsimonious version of the model (Section 3.1). An interpretation of the model parameters is 



outlined in Section 3.2, a discussion of models related to the MLTA model is given in Section 3.3 



and a variational approach for estimating the model parameters is proposed in Section |3.4[ The 
issue of model identifiability is discussed in Section 3^ In Section |4] an adjustment to the BIC 
(Schwarz, 1978) is proposed as a model selection criterion and Pearson's test and the truncated 



sum of squares Pearson residual criterion (Erosheva et al. , 2007) are used to assess model fit. 
Computational aspects of fitting the MLTA model are outlined in Section [5j The two applications 
are presented in Section QA_ and Section Q2_ We conclude in Section [7] by discussing the model 
and the results of its application. 



2 Latent Class & Trait Analysis 

In this section, we give an overview of latent class and latent trait analysis which are the basis of 
the mixture of latent trait analyzers model. We also outline how the models are fitted in practice. 
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2.1 Latent Class Analysis 

The latent class analysis (LCA) can be used to model a set of observations Xi, . . . ,Xiv each 
involving M binary (categorical) variables. Each observation x„ = (x„i,x„2, • • • iXnu)-, where Xnm 
records the value of binary variable m for observation n where Xnm ^ {0, 1}. 

LCA assumes that there is a latent categorical variable z„, = [zni-, Zn2i ■ ■ ■ ■, Zno) for which 
Zng = 1 if observation n belongs to class g and Zng = otherwise. We assume that z„ ~ 
Multinomial(l, (r^i, 772; • • • ; ^70))) where rjg is the prior probability that a randomly chosen indi- 
vidual is in the gth. class (X]^=i Vg = ^ r/g > for all g = 1, . . . ,G). Furthermore, conditional 
on the latent class variable Zng = 1, the response variables x„ = . . . ,XnM) are distributed as 
independent Bernoulli random variables with parameters VTgi, 71^2, • • • , vt^m; therefore, TCgm is the 
conditional probability of observing Xnm = 1 if the observation is from class g (that is, Zng = 1). 
Hence, the dependence between variables is explained by the latent class variable. LCA can be 
seen as a finite mixture model with mixing proportions rjg in which the component distributions 
are multivariate independent Bernoulli distributions. Consequently the log-likelihood function is 
given as, 

N / G M \ 

n=l \g=l m=l / 



Bartholomew et al. (2011) describe how to estimate the parameters by maximum likelihood esti- 



mation via the expectation- maximisation (EM) algorithm (Dempster et al. 1977). 

When LCA is used in a model-based clustering context, the inferred latent classes are assumed 
to correspond to clusters in the data; this may be a valid conclusion when the model assumptions 
are correct but may not be otherwise. 

2.2 Latent Trait Analysis 

Latent trait analysis (LTA) can also be used to model a set of A^ multivariate binary (categorical) 
observations. LTA assumes that there is a D dimensional continuous latent variable y„ {n = 
1, . . . , A^) underlying the behavior of the M categorical response variables within each observation 



x„ (Bartholomew et al. , 2011). The LTA model assumes that 

p(x„) = J p(x„|y„)p(y„)rfy„ (2) 
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where the conditional distribution of x„ given y„ is 

M M 

P(Xn|yn) = n ^ {^nmbn) = JJ Myn)T""' (1 " Vr„ (y„) ) , (3) 
m=l m=l 

and the response function is a logistic function 

TTmiyn) = P{Xnm = Myn) = — TTT" , r ^' < 7r„(y„) < 1 

1 + exp[-(6^ + w4y„)J 

where bm is the intercept parameter and are the slope parameters in the logistic function. 
Thus, the conditional probability that Xnm = 1 given y„ is an increasing function of bm + w^y„ 
and if w^, = we get a probability equal to 1/(1 + exp(— 6^)) independently of the value of y„. 
Finally, it is assumed that y„ ~ ?\f(0,I). Thus, although the variables within x„ are conditionally 
independent given y„, the marginal distribution of x„ accommodates dependence between the 
variables. 

Therefore, the log-likelihood is 

^ = log ( / P(x„|y„)p(y„) dynj=^hgl Yl Pi^nm\yn)p{yn) djn j (4) 

n=l ^"^ ^ n=l m=l / 

The integral on Q cannot be evaluated analytically, so it is necessary to use other methods to 



evaluate it. These are briefly reviewed in Section 2.2.1 -Section 2.2.3 
2.2.1 Gauss-Hermite Quadrature 



Bock and Aitkin (1981) proposed a method to evaluate the integral in ^ by using the Gauss- 



Hermite quadrature (Abramowitz and Stegun, 1964): 



N / Q \ N / Q M 

£ ^log I ^p(x„|yg)/i(yg) j = X^log I X] n Pi^nm\yq)h{yg 

n=l \q=l / n=l \q=l m=l 

where Q is the number of set of integration points and h{yi), . . . , hiyq) (Xl^=i ^(jq) — 1) 
weights for the sets of points yi, . . . , yg. In practice this method treats the latent variables as dis- 
crete taking values yi, . . . , yg with probabilities /i(yi), . . . , /i(yQ). Thus, the model density can be 
approximated by a finite mixture model with Q component densities, where p(x„|yi), . . . ,p{'Xn\yQ) 



are the component distributions and /i(yi) . . . , hiyq) are fixed mixing proportions. Bartholomew 



et al. (2011) explain this approach in detail and outline its drawbacks. The number of component 
densities, Q, required increases exponentially in D, so the Gauss-Hermite quadrature can be hard 
to implement and can be quite slow. Furthermore when the parameters are very unequal their 
estimates can diverge to infinity. 
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2.2.2 Monte Carlo Sampling 



An alternative approach is the Monte Carlo method (Sammel et al. , 1997), that samples (/ 
1, . . . ,L) from the D-dimensional latent distribution, and approximates the log-likelihood as 



n=l \ 1=1 / n=l \ 1=1 m=l 



\yi) 



This approximation of the model density can be seen as a finite mixture model with component 
densities p(x„|yi), . . . ,p(x„|y/,) and the mixing proportions are equal to ^, but usually L is quite 
large, making the implementation of this method quite difficult. 



2.2.3 Variational Approximation 



Tipping (1999) proposed to use a variational approximation (Jaakkola and Jordan, 1996) to fit the 



latent trait analysis model in a manner that is fast in convergence and easy to implement. The 
main aim of this approach is to maximize a lower bound that approximates the likelihood function. 
In latent trait analysis the log-likelihood function Q is governed by the logistic sigmoid function, 
that can be approximated by the exponential of a quadratic form involving variational parameters 
= {^ni, ■ ■ ■ , ^um) where ^nm 7^ for all m = 1, ... , M. This allows for the computation of 
an approximate log-likelihood in closed form. In this case the lower bound of each term in the 
log-likelihood is given by, 

^iin) = log(p(x„|^„)) 
/ M 

= log / JJ piXnm\yn, Cnm)p(yn) djr^ 



m=l 



where 



O-(^nm) exp 



O-(Cnm) = (1 + exp(-^„„)) \ Anm = {^^nm " 1) (&m + W^y„) and A(^„„) = {\ - O^inm)) I'^inra- 

This approximation has the property that p {xnm\yn-, ^nm) < P {xnm\yn) with equality when \^nm\ = 
Anm and thus it follows p (x„|^„) < p (x„). Tipping (1999) and [Bishop (2006) outline a variational 
EM algorithm that maximizes this lower bound and thus fit a latent trait analysis model. 

However, the approximation of the log-likelihood obtained by using the variational approach is 
always less or equal than the true log-likelihood, so it may be advantageous to get a more accu- 
rate estimate the log-likelihood at the last step of the algorithm using Gauss-Hermite quadrature 



(Section 2.2.1); this is discussed further in Sections 
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2.2.4 Comparison of estimation methods 

There are some advantages when using the variational approach to approximate the integral in 
the LTA likelihood. Firstly, the method involved avoids the need for user specified quantities like 
the number of quadrature points or the number of Monte Carlo samples. In addition, the variational 
approach involves iterating a series of closed-form updates that are easy to compute. Secondly, 
the algorithm usually converges considerably more quickly than in Gauss-Hermite quadrature 
and Monte Carlo sampling case, particularly so for large data sets and the approximation of the 
likelihood is more accurate as dimensionality increases, as the likelihood becomes more Gaussian 
in form. But, since the variational approach estimates the true likelihood function by using a 
lower bound, the approximation of the likelihood given by this method is always less or equal than 
the true value, so for low dimensional latent traits it may be better to use the Gauss-Hermite 
quadrature. Additionally, there are also the usual drawbacks of the EM algorithm: the results can 
vary because of the initialization of this algorithm and there is the risk of converging to a local 
maximum instead of the global maximum of the lower bound of the log-likelihood. 

In Figure [T] we compare the response function estimated by the Gauss-Hermite quadrature with 
6 quadrature points and the variational approach for a unidimensional latent trait model. The 
data consists of M = 16 binary variables and = 21574 observations, and are fully described in 



Section 6A_ In this example the two methods bring to very similar results, as shown by the plot 
of the two response functions, and their absolute differences that are always < 0.1. 



Furthermore, Tipping (1999) investigates the accuracy of the variational approach in a high 
dimensional context comparing the estimates given by the variational and the Monte Carlo sample 
approaches in terms of both computing time and errors. 



3 Mixture of Latent Trait Analyzers 

The mixture of latent trait analyzers (MLTA) model generalizes the latent class analysis and 
latent trait analysis by assuming that a set of A^ observations Xi, . . . ,X7v comes from G different 
groups, and the behavior of the M categorical response variables given by each observation x„ (n = 
1, . . . , A^) depends on both the group and the D dimensional continuous latent variable y„. Thus, 
the MLTA model is a mixture model for binary data but where observations are not necessarily 
conditionally independent given the group memberships. In fact, the observations within groups 
are modeled using a latent trait analysis model and thus dependence is accommodated. 

Suppose that each observation comes from one of G groups and we have z„ = {zni, Zn2, • • • , Znc) 
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Figure 1: Response function for the Gauss- Hermite Quadrature and Variational Approach. The 
absolute difference of the response functions is also shown. 
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which is an indicator of the group membership. We assume that z„ ~ Multinomial(l, (?7i, 772, • • • , Vg)) 
where rjg is the prior probabihty of a randomly chosen observation coming from the gth group 
iYl^'=i Vg' = 1 ^ V (? = 1, . . . , G). Further, we assume that the conditional distribution 

of x„ given that the observation is from group g (ie. Zng = 1) is a latent trait analysis model with 
parameters bmg and w^.^; this yields the mixture of latent trait analyzers (MLTA) model. 
Thus, the MLTA model is of the form, 

G G J. 

p(x„) = ^rigp{^n\Zng = 1) = ^ ?7g / p{^n\yn,Zng = 1) p(y„) 
9=1 9=1 

where the conditional distribution given y„ and Zng = 1 is 

M 



P {p^n\yn^ Zfig 1) P {^•^nm\y ni Zng 1) 

m=l 
M 



(5) 



m=l 

and the response function for each group is given by 



7^mg{yn) =P{Xnni = Myn,Zng = l) = 7^ TT, (6) 

1 + exp [-(^mg + w^^y. 



where b^g and w^g are the model parameters. In addition, it is assumed that the latent variable 
y„~X(0,I). 

The log-likelihood can be written as 

N / G ^ M \ 

Z^9 / n ^ (XnmWn, Zng = 1) piju) djn ] , (7) 

SO, the model is a finite mixture model in which the component distributions are latent trait 
analysis models and the mixing proportions are r^i, 772, . . . , ?7g- 



3.1 Parsimonious Model 



It is sometimes useful to use a more parsimonious response function than the one presented in (|6|; 
this is especially important when the data set is high dimensional, comes from several different 
groups and the continuous latent variable is high dimensional. 



Similarly to the factor analysis setting (Bartholomew et al. , 2011 ). 



Dx{D-l) 



parameters are con- 



strained because of the indeterminacy arising from all the possible rotations of = (w 



Ig, 



9/' 
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'"-''^"'- l + expl-fi,..,„ + wly„)l - 0^-M<h (8) 



SO the model in ^ involves {G — 1) + {G x M) + G x (M x D — D x {D — 1) /2) free parameters, 
of which G X (M x D — D x [D — l)/2) are the parameters Wi, W2, . . . , wg. 

If the Wg parameters are constrained to be the same in each group, a more parsimonious model 
would be: 

1 

exp [-{h^g + ^ljn)] 

that supposes that the w^g = for each group, so that it involves (G — 1) + (G x M) + 
(M X D — D X {D — I) /2) free parameters. 

3.2 Interpretation of Model Parameters 

In the mixture of latent trait analyzers model rjg is the mixing proportion for the group g {g = 
1, . . . ,G), that corresponds to the prior probability that a randomly chosen individual is in the 
g-th group. The behavior of the individuals within the group g is characterized by the parameters 
bmg and Wmg- In particular bmg has a direct effect on the probability of a positive response to the 
variable m given by an individual in group g, through the relationship 

T^mgW = p{Xnm = l|yn = 0, Z^g = 1) = — ^737 ^- (9) 

i + exp(^ Omg) 

The value 7rmg(0) is the probability that the median individual in group g has a positive response 
for the variable m, since the continuous latent variable is distributed as a 3\f(0,I). A measure 
of the heterogeneity of the values of variable m within group g is given by the slope value w^^; 
the larger the value of w^g the greater the differences in the probabilities of positive response in 
the variable m for observations from group g. The value of the slope parameters also account for 
the dependence between observed data variables. For example, if two variables m and m' yield a 
positive (negative) value for w^^w^/g then they will both simultaneously have a probability of a 
positive outcome greater (lesser) than the median probability for group g more often than expected 
under local independence. 

The quantity Wdmg can be used to calculate the correlation coefficient, within each group g, 
between the observed variables and the latent variable (ci = 1, . . . , -D) which is given by its 



standardized value (Bartholomew et al. , 2002) 



^ .,,2 



W 



d'mg 



Another useful quantity for analyzing the dependence within groups is a version of lift (Brin 



et al. , 1997). We use lift within each group to quantify the effect of the dependence on the 
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probability of two positive responses compared to the probability of two positive responses under 
an independence model. The lift is defined as, 



'11] 





Zng — 


1) 




Zng = 1) P {Xnk = 1 


^ng 1) 



where m = 1,2, ... ,M and m ^ k. Two independent positive responses have lift = 1: the more 
the variables are dependent, the further the value of the lift is from 1. It is possible to estimate it 
by using the Gauss-Hermite quadrature as. 



hft 



E 



Q 

7=1 "mg 



iyq)T^kg{yq)h{yg 



(12) 



J2q=l T^mg(jg)hiyg)^ {T,q=l '^kg{yq)h{yg 

where /i(yi), . . . , ^(yg) are the appropriate weights associated with the quadrature set of points 



yi, . . . , yg and the parameters are estimated using the variational method (Section 3.4). Lift values 
that are much less than 1 are evidence of negative dependence within groups and lift values that 
are greater than 1 are evidence of positive dependence within groups. 

Finally, the posterior distribution of the latent variable y„ conditional on the observation 



belonging to a particular group can be obtained from the model outputs (see Section 3.4) and the 
posterior mean estimates of these y„ scores can be used to interpret the latent variables within 
each group. 



3.3 Related Models 

The MLTA model has a lot of common characteristics with a number of models in the statistical 
and wider scientific literature. 

The MLTA model can be seen as a discrete response variable analogue of the continuous re- 



sponse variable mixture of factor analyzers (MFA) model (Ghahramani and Hinton, 1997 McLach 



Ian et al. 2003) and more recently developed parsimonious analogues of this model (McNicholas 



, 2008 


2010 


Baek et al. , 


2010) 



counts for grouping structure and a latent factor score accounts for correlation between response 
variables within groups. The component means in the MFA model are analogous to the intercepts 
in the MLTA model, the loading matrix is analogous to the slope parameters, the mixing propor- 



tions take an identical role in both models. Additionally, Muthen (2001) describes a general latent 
variable model for continuous outcome variables as used by the Mplus software and this model 
also uses both continuous and discrete latent variables in the model framework. 
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The Rasch model (Rasch 1960) is a commonly used model for analyzing binary (and categorical 
data) and it has been widely used in educational statistics. The Rasch model has a very similar 
structure to the LTA model. In the Rasch model, the probability of a positive outcome is given as 

where the /3„ values are called ability parameters and the 6m values are called difficulty parameters 
and these correspond to the latent trait and the intercept parameters in the LTA model. Further- 
more, the model makes a local independence assumption when constructing the probability 

M 

P(x„|/3„,(5) = \[nXnm\Pn,5m), 

m=l 

where S = [61,62, ■■ ■ ,6m)- In some formulations of the Rasch model, the ability parameters are 
treated as fixed unknown parameters and in other formulations they are treated as random effects 
thus making the model equivalent to the univariate LTA model. 

Consequently, the mixture of latent trait analyzers (MLTA) model has a similar structure to 



the mixture Rasch model (Rost, 1990; Rost and von Davier| [1995; von Davier and Yamamoto 



2007), where the response function is: 



(x =1\B 6 z = p - ^^P(/^" ^ 



1 + exp(/3„ - 6mg) 

However, as with the Rasch model, the mixture Rasch model usually includes a univariate ability 
parameter /3„ that is the same across all the variables and groups, that would be equivalent to 
assuming that MLTA model W\g = w^g = . . . = WMg and Wmg = Wm for all g. 



In von Davier et al. (2007) they extend the mixture Rasch model in order to allow multivariate 



exp(q^/3„ - 6mg) 
1 + exp(q^/3„ - 5„ 



ability parameters: 

■IP('^nm 1 1 Qm; /^Ti; '^m; •^ng 1) 

1 -r K^\J\^mh'n ~ ^mgj 

where are variable-specific parameters, and /3„ is the /^-dimensional ability parameter. This is 
equivalent to a MLTA model with Wmg = Wm for all g, as in our parsimonious model. 

Thus, many versions of the mixture Rasch model are included as special cases of the MLTA 
model. Again, the latent ability parameter is either assumed to be a fixed or random effect 
that varies across observations to accommodate different outcome probabilities; the latent ability 
parameter is the primary quantity of interest in many Rasch modeling applications. 
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Another model that has a very similar structure to the mixture Rasch model is the mixed latent 



trait (MLT) model described in Uebersax (1999), this model also has a univariate latent trait, uses 
a probit structure and numerical integration are required for computing the likelihood, so they can 



be difficult to apply to large heterogeneous data sets. Qu et al. (1996) and Hadgu and Qu (1998) 
use a two component model with a similar form to the MLT model in medical diagnosis. 



Additionally, Uebersax (1999) also describes a probit latent class model (PLCA) which has a 
similar structure to the MLTA model, but this model uses a multivariate latent trait which has the 



same dimensionality as outcome variable. Vermunt and Magidson (2005) developed a latent class 
factor analysis (LCFA) model which uses discrete latent variables to develop a model with similar 
modeling aims to latent trait analysis, but which can be fitted in a much more computationally 
efficient manner. 

Perhaps, the MLTA model proposed has closest connections to the family of models identified 



as multilevel mixture item response models (Vermunt, 2007). Key differences between our model 
and the multilevel mixture item response model are that we focus on a multivariate trait parameter 
(which is optional only in the model of Vermunt) and that we further also introduce a constrained 
parsimonious version of the model. In addition, we offer a computationally efficient alternative 
algorithm for fitting this model without the need to resort to numerical quadrature methods. The 
MLTA and multilevel mixture item response models can be estimated using the software Latent 



GOLD (Vermunt and Magidson, 2008), but the fact that this software uses quadrature methods 
for the numerical integration of continuous latent variables probably makes it less suitable for 
analyzing large datasets with underlying highdimensional latent trait structures. 



3.4 Model Fitting 

When fitting the MLTA model, the integral in the log-likelihood ([T]) cannot be solved analyt- 
ically since it is exactly the same as in latent trait analysis (|4]). To obtain the approximation of 
the likelihood it is possible to use an EM algorithm: 

1. E-step: Estimate Zng as the approximate posterior probability that an individual with re- 
sponse vector x„ belongs to group g. 

2. M-step: Estimate T]g as the proportion of observations in group g. 

Estimate the integral in the complete-data log-likelihood by using the variational approach. 
Estimate the parameters b^g and w^g for m = 1, 2, . . . , M and g = 1,2, ... ,G. 
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Also, estimate the approximate log-likelihood value. 
3. Return to ste2)[T] until convergence is attained. 
The advantages and the drawbacks of the different estimate methods for the likelihood are the 



same as in latent trait analysis (Section 2.2.4). 

We further focus on the variational approach to approximate the log-likelihood because it can 
be efficiently implemented. Similarly to the latent trait analysis case, we introduce the variational 
parameters = {C,nig, ■ ■ ■ ,^nMg) where ^nmg 7^ for all m = 1,...,M to approximate the 
logarithm of the component densities with a lower bound: 

^iing) = log {p{^n\Zng = 1, ^„g)) 

= log I / {Xnm\yn, Zng = 1, ^nmg) PiYn) C^Yn 

V"^ m=l 

where the conditional distribution p^Xmnlyyu, Zng = 1) is approximated by, 

P {Xnm\yn, ^ng = 1; ^nmg) = (^{^nmg) GXp ^ ^ h \{^nmg) {^nmg ~ ^nmg)^ ) 

where a{^nmg) = (1 + exp(-^„„g))"\ A^mg = C^Xnm - l)ibrng + W^^y^) and Xi^nrng) = H ^ 
i^^nmg)^ / '^^nmg ■ 

To obtain the approximation of the log-likelihood it is necessary to use a double EM algorithm 
on the {i + l)th iteration: 

1. E-step: Estimate Zn^^'' with: 



2. M-step: Estimate ril^~^^^ using: 



S^N (j+1) 
(j+1) _ 2^n=l ^"g 



N 

3. Estimate the likelihood: 



(a) E-step: Compute the latent posterior statistics for p (yn\^n, Znfl~^^ = which is a 

density: 



M "I ~1 



^ / ^ ^ ' VSnrraq/ mq mq 
m=l 



14 



and 



H'ng 



Q{i+1) 



-ng 



■ M 

E 

m=l 



2 ' '\^nmg/"mg I m 



) 



(b) M-step: Optimize the variational parameter C.nmg'' in order to make the approximation 



p{^n\Z: 



ng 



l^^ratT^'') close as possible to p(x„|z 



ng 



1) for all x„ using: 



(j+l)2 
nmg 



mg \^ng ' ^^ng H'ng J mg ~ "mg mg r^ng ~ "mi 



2 

mg 5 



smce p [x 



nmiYm Zng = 1, ^nmg) IS Symmetric in inmg choosing either the positive or the 



negative root of S,nmg yields to the same results. 



(c) Optimise the model parameters w 



mg and hmg 



in order to increase the approximate 



likelihood p ( xi, . . . , XAr|^i\^^\ . . . , ^J,^^^ 



usmg: 



w 



(i+i) 

mg 



N 



n=l 



where Wmg^-* 



w 



(i+l)T ,(i+l)NT -(j+1) 



mg 



" TV 

_n=l 
{i+l)T ^-^T 



-ng 



H'ng 



ng 



and 



E[yny„jg 

(d) Estimate the lower bound: 

M 



f^(i+l) I (j+1) («+l)T 
^ng ~v t-^ng t-^ng t-^ng 



m=l 



log 2 -^('Cnmg'*)^img^ "I" ^^n.m 2^ ^mg ^ '^('bnmg^)^mg 



+ 



iog|c^:g+^^i 



+ 



fJ'ng 



(i+l)Tr^(i+l)l_l,,(i+l) 



t^ng 



and the log-likelihood: 



N 



G 



£«^^log|5^r/fi)exp(£(4+i))) 
.9=1 



n=l 



Return to step [T] until convergence is attained. 

Estimate the log-likelihood by using the Gauss-Hermite quadrature: 



N 



G 



^log I ^77g^p(x,|yg 

5=1 9=1 



; ^ng 



my,) 



(13) 



n=l 



where /i(yi), . . . , h{yQ) are the appropriate weights associated to the quadrature set of points 
yi, . . . , yg and p(x„|yg, Zng = 1) is given by Q and Q. 
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To fit the parsimonious model outlined in Section 3A_ it is possible to use the variational 
approach using the EM algorithm described above, except for the estimate of the model parameters 



at step 3c 



where 



and, 



w 



(i+i) 



m \ m 5 "ml 5 • • • 5 "rriG / ' Im 



E 



N 



1 ^nG 



2^?i=l ^nG '^y'inmG )f^nG 



(i+1) wt(i+l)N,,(i+l)T 



Z_^n=l ^nl ^V'Snml / 






Z^n=l ^nG ^V?nmG /^*nG 





v^AT 







''nG 



Hi 



nmG ■ 



The derivation of these parameter estimates is given in Section 1 of the supplementary material. 



3.5 Model Identifiability 



Model identifiability is an important issue for a model involving many parameters. Goodman 



(1974), Bartholomew et al. (2011) and Dean and Raftery (2010) give a detailed explanation of 



model identifiability in latent class analysis and Bartholomew (1980) introduces this issue in the 



latent trait analysis context. In AUman et al. (2009) they argue that the classical definition of 



identifiability is too strong for a lot of latent variable models, and they introduce the concept of 
"generic identifiability" that implies that the set of points for which identfiability does not hold 
has measure zero. They explore "generic identifiability" for different models, including LCA. 

A necessary condition for model identifiability is that the number of the free estimated param- 
eters not exceed the number of possible data patterns. Nevertheless this condition is not sufficient 
as the actual information in a dataset can be less depending of the size or the frequency of pattern 
occurrences within the dataset. 
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As with the loading matrices in mixture models with a factor analytic structure (eg. Ghahra- 



mani and Hinton 1997 McNicholas and Murphy 2008) the slope parameters are only identifi- 



able up to a rotation of the factors. The rotational freedom of the factor scores is important when 



determining the number of parameters in the model (Section 3.1 and Section |4]). 

In addition, model identifiability holds if the observed information matrix is full rank. As a 
result possible non-identifiability can manifest itself through high standard errors of the parameter 
estimates. Another empirical method that can be used to assess non-identifiability consists of 
checking whether the same maximized likelihood value is reached with different estimates of the 
model parameter values when starting the EM algorithm from different values. 

We found that these checks for identifiability were all satisfied in the empirical examples dis- 
cussed in Section [6l 



4 Model Selection 



The Bayesian Information Criterion (BIG) (Schwarz (1978)) can be used to select the model 



BIG = -2i + klog{N), 

where k is the number of free parameters in the model and is the number of observations. The 
model with the lower value of BIG is preferable. It is important to remember that its value could 
be overestimated if the log-likelihood is approximated by using the variational approach; hence, 
the proposal to use Gauss Hermite quadrature to evaluate the maximized log-likelihood for model 
selection purposes. 

In the context of MLTA, the values of G, D and whether the values are constrained to be 
equal across groups need to be determined. 

In the mixture of latent trait analyzers context (and more widely within finite mixture models) 
the BIG penalizes too much the models with high D and/or high G. The BIG as defined by 



Schwarz (1978) implicitly assumes that all the parameter estimates depend on the entire set of 
A^ observations, but in MLTA the estimates of and depend just on the observations that 
belong to group g. So, an alternative penalized BIG is given as 

G 

BIG* = -2i +{G-1) \og{N) + k*Y^ ^ogiVgN) 

9=1 

G 

= BIG + r5^1og(r/,), 

9=1 
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where k* is the number of free parameters depending on each group of observations. This penalty 
penahzes parameters to a lesser extent than BIG because only the estimated number of observations 
involved in the estimation of each parameter is used in the penalty. This version of BIG has 



previously been proposed by Steele (2002) and similar criteria have been proposed by Pauler 



(1998) and Raftery et al. (2007). It is worth noting that BIG* will behave in a similar way to BIG 
for large sample sizes. Hence, it can be seen as a small sample adjustment to BIG. 

The Pearson's x^-test can be used to check the goodness of the model fit. The statistic is 
calculated as 



X 



(Observedp — Expected^)" 
Expected 



E 

p=i 

■y^^ (Observedn — Expected„)^ 



TV 



n=l 



Expected^ 



Expected^ 



n=l 



where P = 2*^ is the total number of possible patterns, N_ is the number of observed patterns 
and Observed„ and Expected^ = Np{'Xn) represent the observed and expected frequencies for the 
n-th response pattern, respectively. Under the null hypothesis the statistic is asymptotically 
distributed N ^ oo. If M is large it is common to have a large number of very small 

counts and the Pearson's x^-test is not applicable. In this case, Erosheva et al. (2007) suggest the 
truncated SSPR criterion to examine deviations between expected and observed frequencies via 
the sum of squared Pearson residual only for the patterns with large observed frequencies. 



5 Computational Aspects 

The estimates of the parameters in the latent class analysis models {D = 0) are the exact maxi- 
mum likelihood estimates. Since the dimensionality of the data in the two data sets is large, the 
parameters for the model with D > 1 are estimated by using the variational approach and the 
log-likelihood has been calculated at the last step of the algorithm by using the Gauss-Hermite 
quadrature, with 5 quadrature points per dimension; the results were not sensitive to the number 
of quadrature points but the computational time was very heavily dependent on this number. 

The categorical latent variables z„ {n = 1,...,N) have been initialized by randomly as- 
signing each observation to one of the G possible groups. The variational parameters ^nmg 
[n = 1, . . . ,N, m = 1, . . . , M, g = 1, . . . ,G) are initialized to be equal to 20, this implies the 
initial approximation of the conditional distribution to be very close to and it reduces the depen- 
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dence of the final estimates on the initializing values. The model parameters bmg and w^g have 
been initialized by random generated numbers from a 3\f(0, 1). Ten random starts of the algorithm 
were used and the solution with the maximum likelihood value was selected. 

The standard errors of the model parameter have been calculated using the jackknife method 



(Efron, 1981). It is worth noting that when employing this method the estimates of the parameter 
without the n-th observation {n = 1, . . . , N) can be obtained in just a few iterations. 

Since the EM algorithm is linearly convergent, a criterion based on the Aitken acceleration 



(McLachlan and Peel, 2000) has been used to determine the convergence of the algorithm. The 



EM algorithm has been stopped when 

where i is the iteration, tol = 10^^ is the desired tolerance, 

1 



-A 



and 



6 Applications 

6.1 National Long Term Care Survey (NLTCS) 



Erosheva (2002, 2003, 2004[ ) and Erosheva et al. ( 2007[ ) report on 16 binary outcome variables 



recorded for 21574 elderly (age 65 and above) people who took part in the National Long-Term Care 
Survey in the years 1982, 1984, 1989 and 1994. The outcome variables record the level of disability 
and can be divided in two subsets: the "activities of daily living" (ADLs) and "instrumental 
activities of daily living" (lADLs). The first subset is composed of the first six variables which 
include basic activities of hygiene and personal care: (1) eating, (2) getting in/out of bed, (3) 
getting around inside, (4) dressing, (5) bathing, (6) using the toilet. The second subset concern the 
basic activities necessary to reside in the community: (7) doing heavy house work, (8) doing light 
house work, (9) doing laundry, (10) cooking, (11) grocery shopping, (12) getting about outside, 
(13) travelling, (14) managing money, (15) taking medicine, (16) telephoning. The responses are 
coded as 1 = disabled and = able. 

The MLTA model is fitted to these data for D = 0,1,2,3 and G = 1, 2, . . . , 11; the D = case 
corresponds to the LCA model and the G = 1 & D > case corresponds to the LTA model. 

Table [T] records the log-likelihood evaluated for the parameter estimates found using the algo- 



rithm outlined in Section 2.2.3 Estimates of the log-likelihood and the lower bound are reported 
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to emphasize the importance of estimating the log-hkehhood at the last step of the variational 
approach by using the Gauss-Hermite quadrature instead of the lower bound values. 



Table 1: Approximated log-likelihood and lower bound (in parentheses) 







D = 


D 


= 1 


D 


= 2 


D 


= 3 


G = 


1 


-200085.10 


-140318.06 


(-145827.00) 


-136169.53 


(-144483.30) 


-136075.66 


(-144483.30) 


G = 


2 


-152527.30 


-135301.29 


(-139930.20) 


-134273.79 


(-139283.00) 


-134275.46 


(-139091.30) 


G = 


3 


-141277.10 


-134362.61 


(-136349.10) 


-133025.27 


(-136080.80) 


-133008.17 


(-136066.50) 


G = 


4 


-137464.20 


-133120.36 


(-134540.70) 


-131839.77 


(-134253.60) 


-132116.82 


(-134145.00) 


G = 


5 


-135216.20 


-131813.29 


(-133203.40) 


-131505.23 


(-133061.10) 


-131393.42 


(-132919.80) 


G = 


6 


-133643.80 


-131396.59 


(-132261.50) 


-131154.94 


(-132161.60) 


-130992.52 


(-132123.00) 


G = 


7 


-132659.70 


-131120.79 


(-131840.30) 


-130729.39 


(-131785.80) 


-130607.37 


(-131731.10) 


G = 


8 


-132202.90 


-130708.20 


(-131565.90) 


-130450.55 


(-131146.60) 


-130403.20 


(-131106.60) 


G = 


9 


-131367.70 


-130342.81 


(-130866.20) 


-130164.32 


(-130855.20) 


-130155.19 


(-130798.00) 


G = 


10 


-131155.90 


-130135.91 


(-130806.60) 


-130049.64 


(-130681.20) 


-129936.33 


(-130544.50) 


G = 


11 


-130922.60 


-130110.22 


(-130574.40) 


-129860.74 


(-130475.10) 


-129881.83 


(-130404.50) 



The estimated BIG and BIG* have been used to select the best model (Table [2]), both criteria 
indicate that the model with 10 groups and a one-dimensional latent variable as the best model. 



Fienberg et al. (2009) show that the LGA model that minimizes the BIG has nineteen latent classes, 
so the mixture of latent trait analyzers suggests that there are much fewer groups. The nineteen 
latent class model has a lower BIG (BIG=262165.07) but they argue that this number of classes 
is not sensible in the context of the application. This suggests that the LGA is using multiple 
groups to account for dependence and that the latent classes do not necessarily correspond to data 
clusters. 

Since there is a large number of response patterns with a very small number of observations 
(of all the 2^^ = 65536 possible response patterns, 62384 (95.2%) contain zero counts and only 
481 (0.7%) contain more than 5 counts), the Pearson's test is not applicable and the truncated 
SSPR criterion has been used to test the goodness of fit of the model. Three different levels of 
truncation are shown in Table IH 

The best model selected by the BIG and the BIG* is one of the best fits as indicated by 
the SSPR over all the three levels of truncation. Table 1 of the supplementary material shows 
a comparison of the observed and the expected frequencies for the response patterns with more 
than 100 observations under the best model selected. The table shows that there is a close match 
between the observed and expected frequencies under this model. 
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Table 2: The estimated BIG (left) and BIG* (right) for the models with D = 0,1,2,3 and G = 
1,2,. ..,11. 



15 = D = l D = 2 D = 3 D = D = l D = 2 D = 3 



G = 


1 


400329.84 


280955.46 


272808.09 


272760.05 


G = 


1 


400329.84 


280955.46 


272808.09 


272760.05 


G = 


2 


305383.99 


271251.24 


269495.61 


269778.37 


G = 


2 


305360.21 


271204.68 


269427.88 


269690.47 


G = 


3 


283053.38 


269703.18 


267477.56 


267862.51 


G = 


3 


282997.58 


269596.55 


267322.64 


267661.41 


G = 


4 


275597.01 


267548.00 


265585.58 


266698.51 


G = 


4 


275504.40 


267367.43 


265322.50 


266345.65 


G = 


5 


271270.65 


265263.18 


265395.51 


265870.43 


G = 


5 


271136.22 


264993.58 


265006.92 


265366.66 


G = 


6 


268295.46 


264759.09 


265173.93 


265687.34 


G = 


6 


268113.98 


264403.40 


264614.73 


264973.09 


G = 


7 


266496.97 


264536.80 


264801.82 


265535.74 


G = 


7 


266267.39 


264069.53 


264102.10 


264643.31 


G = 


8 


264882.64 


264040.94 


264723.15 


265746.12 


G = 


8 


264591.64 


263464.76 


263877.58 


264649.68 


G = 


9 


264252.28 


263639.47 


264629.69 


265868.82 


G = 


9 


263910.91 


262952.98 


263618.30 


264544.87 


G = 


10 


263998.39 


263554.99 


264879.33 


266049.81 


G = 


10 


263600.50 


262766.36 


263684.61 


264524.20 


G = 


11 


263701.27 


263832.93 


264980.55 


266559.53 


G = 


11 


263242.58 


262921.59 


263581.68 


264754.70 



From Table |2] and [3] it is also possible to see how the mixture of latent trait analyzers model is 
more appropriate than the latent class analysis and the latent trait analysis, in terms both of BIG 
and goodness of fit. 

Table 3: The sum of squared of Pearson residuals for different levels of truncation (> 100, > 25, 
> 10). 



observed frequencies > 100 observed frequencies > 25 observed frequencies > 10 







D = 1 


D = 2 


D^3 


D = 


D = 1 


D = 2 


£> = 3 


D = 


D = 1 


D = 2 


D = 3 




1 


6.1e+09 


2441 


1924 


4082 


6.3e+09 


20524 


21513 


25795 


6.3e+09 


38927 


25130 


30009 


G = 


2 


80199 


1600 


1318 


1718 


129642 


4614 


3590 


4822 


182265 


8126 


6614 


9654 


G = 


3 


5304 


1470 


1410 


1597 


16588 


4610 


2974 


3318 


61331 


7610 


4898 


5814 


G = 


4 


4875 


1439 


638 


806 


10762 


3060 


1874 


2530 


17405 


5498 


2993 


4609 


G = 


5 


2717 


598 


490 


545 


6321 


1788 


1498 


1683 


10886 


3414 


3037 


3255 


G = 


6 


1434 


323 


533 


462 


3901 


1423 


1787 


1530 


6652 


2498 


2758 


2817 


G = 


7 


561 


561 


561 


418 


2412 


2412 


2412 


1450 


4808 


4808 


4808 


2468 


G = 


8 


413 


413 


413 


234 


1837 


1837 


1837 


1228 


3792 


3792 


3792 


2115 


G^ 


9 


417 


253 


183 


229 


1647 


987 


938 


1054 


3010 


1725 


1534 


1761 


G = 


10 


348 


160 


292 


150 


1348 


723 


827 


946 


2695 


1367 


1495 


1558 


G = 


11 


280 


196 


171 


118 


1336 


872 


760 


700 


2246 


1518 


1348 


1481 



The parameter estimates and standard errors for the selected model are reported in Table |4} 
The standardized values and median probabilities vrmg(O) are also reported to aid interpretation 
of the groups found by this model. 
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Table 4: Parameter estimates {standard errors) for the selected model {G = 10, D = 1) 

g=l 9 = 2 S = 3 







b„ 


^.9 






w 


mg 








T^gCO) 




ij 


mg 








mg 






71 J, 


■Km 


g(0) 




b 


mg 






w 


mg 




ui 


Tig 


Tmg(O) 


m 




1 


-3 


16 


(0 







02 


(0 


0.<() 





02 


0.04 


-2 


89 


(0 


12) 





96 


(0 


12) 





69 





07 


-5 


68 


(0 


06) 





00 


(0 


00) 





00 







m 


= 


2 


-0 


97 


(0 


OS) 


1 


11 


(0 


06) 





74 


0.31 


-4 


53 


(0 


12) 





05 


(0 


06) 





05 





01 


-0 


26 


(0 


05) 





13 


(0 


04) 





13 





44 


m 


= 


3 





38 


(0 


0,9 ) 


1 


36 


(0 


05) 





81 


0.67 


-2 


81 


(0 


15) 


-0 


01 


(0 


07) 


-0 


01 





06 


2 


37 


(0 


07) 


-0 


08 


(0 


02) 


-0 


08 





91 


m 


= 


4 


-1 


46 


(0 


0,9 ) 





38 


(0 


07) 





36 


0.19 


-1 


90 


(0 


12) 


1 


50 


(0 


11) 





83 





20 


-2 


72 


(0 


OS) 





07 


(0 


02) 





07 





06 


m 




5 





64 


(0 


OS) 





64 


(0 


07) 





64 


0.65 


-0 


49 


(0 


10) 


1 


42 


(0 


09) 





82 





41 





38 


(0 


05) 





83 


(0 


02) 





64 





69 


m 




6 


-1 


19 


(0 


OS) 





73 


(0 


07) 





59 


0.25 


-2 


74 


(0 


12) 


1 


05 


(0 


12) 





72 





09 


-0 


92 


(0 


06) 





67 


(0 


04) 





56 





29 


m 




7 


5 


95 


(0 


08) 


-0 


05 


(0 


02) 


-0 


05 


1 


1 


78 


(0 


11) 


1 


70 


(0 


09) 





86 





77 


2 


51 


(0 


OS) 





11 


(0 


OS) 





11 





92 


m 




8 





69 


(0 


08) 


-0 


14 


(0 


07) 


-0 


14 


0.67 


-1 


54 


(0 


11) 


1 


77 


(0 


08) 





87 





26 


-3 


44 


(0 


07) 


-0 


01 


(0 


01) 


-0 


01 





03 


m 




9 


2 


84 


(0 


11) 


-0 


12 


(0 


05) 


-0 


12 


0.94 





61 


(0 


11) 


1 


80 


(0 


08) 





87 





60 


-0 


37 


(0 


05) 





23 


(0 


05) 





22 





41 


m 




10 


1 


54 


(0 


10) 





31 


(0 


07) 





30 


0.82 





69 


(0 


11) 


1 


73 


(0 


OS) 





87 





61 


-2 


21 


(0 


07) 





01 


(0 


02) 





01 





10 


m 




11 


3 


10 


(0 


Iti) 





29 


(0 


07) 





28 


0.96 


2 


30 


(0 


i;i) 





82 


(0 


12) 





63 





89 


1 


68 


(0 


07) 





05 


(.0 


o.-i) 





05 





84 


m 




12 


1 


91 


(0 


10) 


1 


00 


(0 


OS) 





71 


0.84 


-0 


28 


(0 


00) 





67 


(0 


09 ) 





56 





44 


4 


39 


(0 


06) 





00 


(0 


01) 





00 





99 


m 




13 


1 


62 


(0 


09 ) 





66 


(0 


OS) 





55 


0.82 


1 


72 


(0 


11) 





08 


(0 


10) 





08 





85 


1 


60 


(0 


07) 





00 


(0 


o;i) 





00 





83 


m 




14 





11 


(0 


0,9 ) 


1 


14 


(0 


06) 





75 


0.52 


2 


02 


(0 


14) 





25 


(0 


07) 





25 





88 


-1 


59 


(0 


06) 


-0 


03 


(0 


o:i) 


-0 


03 





17 


m 




15 


-0 


38 


(0 


OS) 





89 


(0 


06) 





67 


0.42 





52 


(0 


09) 





83 


(0 


09 ) 





64 





61 


-2 


29 


(0 


07) 





01 


(0 


02) 





01 





09 


m 




16 


-0 


98 


(0 


OS) 





93 


(0 


06) 





68 


0.30 





31 


(0 


09) 





47 


(0 


09) 





42 





57 


-3 


09 


(0 


OS) 


-0 


07 


(.0 


02) 


-0 


07 





04 
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6.1.1 Analysis of the Groups in the Selected Model 

From Table |4] it is possible to observe that Groups 7 and 8 consist mainly of the people that 
are able to do all the activities examined by the NLTCS survey. However, Group 8 has higher 
probability of being unable to do the activities. In particular, the probability of not being able 
to do heavy house work (activity 7) is much higher in Group 8 than in Group 7. In both groups, 
the estimated values are zero, indicating that the local independence assumption is adequate 
in these groups. The fact that is zero also indicates that there is no more variability in each 
outcome than can be described by a Bernoulli distribution. 

In contrast. Group 6 consists mainly of disabled people. In addition, the value is non- 
zero and the slope parameters are very negative for the activities of managing money (activity 
14), taking medicine (activity 15) and telephoning (activity 16). The slope parameters for these 
activities indicate that there is considerable variability in the outcomes in these activities and that 
the outcomes are positively dependent (that is, those able/disabled in one activity will tend to be 
able/disabled in the others). It is worth noting that these activities are the least physically taxing 
of the instrumental activities of daily living, as recorded in these data. 

Group 9 is characterized by people able to do the activities 1, 4, 8, 9, 10, 11, 13, 14, 15, 16 that 
correspond to eating, dressing, doing light house work, doing laundry, cooking, grocery shopping, 
travelling, managing money, taking medicine, telephoning, but these individuals are unable to get 
around inside and outside (activities 3 and 12). For the other activities (number 2, 5, 6, 7) there is 
quite a big variability in responses explained by the within group latent trait structure; the slope 
parameters for these activities are all negatively signed which again indicates that these activities 
exhibit strong positive dependence. These activities getting in/out of bed (activity 2), bathing 
(activity 5), using the toilet (activity 6) and doing heavy housework (activity 7) require mobility 
which may explain the positive dependence between the outcomes in these. 

Group 1 consists mainly of people able to eat (activity 1), but unable to do heavy house work 
(activity 7), for the other activities there is quite a big variability explained by the within group 
latent trait structure, but the median individual in this group has a probability in excess of 80% 
to be disabled in activities 9, 10, 11, 12, 13. Activities 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16 have 
highly positive slope parameters, whereas activities 7, 8, 9 have moderately negative slopes. Thus 
there is positive dependence within these sets of activities but negative dependence across these 
groups. 

Group 2 consists mainly of people able to get in/out of bed, and get around outside (activities 
2,3), but unable to travel (activity 13), for the other activities there is quite a big variability 
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explained by the within group latent trait structure, but the probability of the median individual 
of this group being disabled is large (over the 88%) for the activities 11, 14 (grocery shopping and 
managing money), and small (less than the 10%) for the activities 1 and 6 (eating and using the 
toilet). Interestingly, the median individual tends to have lower probability of being disabled in 
the activities of daily hving (ADLs) than in the instrumental activities of daily living (lADLs). 
The slopes for most activities have equal sign, so there is evidence of strong positive dependence 
between activities in this group. Thus people are either able or unable to do multiple activities in 
this group to a greater degree than independence would suggest. 

Group 3 consists mainly of people able to do the activities 1, 4, 8, 10, 14, 15, 16, but unable to 
do the activities 3, 7, 11, 12, 13, for the other activities there is quite a big variability explained by 
the within group latent trait structure. The activities with large slope parameters involve physical 
movement and these again exhibit positive dependence in a similar manner to Group 9. 

Group 4 is characterized mainly by people unable to do the activities of daily living 2, 3, 4, 5, 6, 
7, but there is large variability in the responses explained by the within group latent trait structure. 
In contrast to Group 2, for this group the median individual tends to have a higher probability of 
being disabled in the activities of daily living (ADLs) than in the instrumental activities of daily 
living (lADLs). The signs of the large magnitude slope parameters are equal, so there is strong 
positive dependence within this group also. 

Group 5 consists mainly of people able to do the activities 1, 2, 3, 4, 6, 8, 10, 15, 16, but 
unable to do heavy house work (activity 7), for the other activities there is quite a big variability 
explained by the within group latent trait structure. The large magnitude slope parameters are 
equal in sign, so there is strong positive dependence within these activities within this group also. 

Group 10 consists mainly of people able to telephone (activity 16), but unable to do heavy 
house work and get around inside and outside (activities 7, 3 and 12), for the other activities there 
is quite a big variability explained by the within group latent trait structure, but the probability of 
being disabled for the median individual of this group is large (over the 80%) for the variables 2, 5, 
11, 13, and quite small (between the 14% and the 21%) for the variables 1, 14, 15. In this group, 
activities 1, 2, 4, 5, 6, 14, 15 have strongly positive slope and activities 8, 9, 10, 11 have negative 
slopes. Thus, there is positive dependence within these activity groups and negative dependence 
across the activity groups. The second group of activities require a large amount of movement 
whereas the first group require less so. 

The groups found in this analysis suggest that there is a wide range of experiences in terms of 
activities of daily living in the individuals included in the NLTCS data. The groups have a similar 
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structure to the one found by Erosheva et al. (2007) but their model allows individuals to have 
mixed membership across groups, so they suggest that slightly fewer groups (between 7 and 9) are 
needed. 



6.2 U.S. Congressional Voting 

The voting records of the U.S. House of Representatives Congressmen in the second session of the 



98th congress (January 23rd, 1984-October 12th, 1984) are recorded in Congressional Quarterly 



Almanac (1984). During this session the House of Representatives recorded votes on 408 issues. 



Sixteen of the key issues voted on by the House are described in this publication (Congressional 



Quarterly Almanac, 1984, Pages 7-C to 11-C). The voting data for these sixteen issues was col- 



lated in Schlimmer (1987) and is publicly available in the UCI Machine Learning data repository 



(Asuncion and Newman, 2007). 



The data set contains the votes of all 435 House of Representatives congressmen on the sixteen 
issues: each vote has a value of yes (voted for, paired for, and announced for), no (voted against, 
paired against, and announced against) or undecided (voted present, voted present to avoid conflict 
of interest, and did not vote or otherwise make a position known). It is also known if the voter is 
a Democrat or a Republican congressman. The issues voted on were: (Ql) Handicapped Infants, 
(Q2) Water Project Cost-Sharing, (Q3) Adoption of the Budget Resolution, (Q4) Physician Fee 
Freeze, ( Q5) El Salvador Aid, ( Q6) Religious Groups in Schools, ( Q7) Anti-Satellite Test Ban, (Q8) 
Aid to Nicaraguan 'Contras', (Q9) MX Missile, (QIO) Immigration, (Qll) Synfuels Corporation 
Cutback, (Q12) Education Spending, (Q13) Superfund Right to Sue, (Q14) Crime, (Q15) Duty- 
Free Exports, (Q16) Export Administration Act /South Africa. 

Each question has been coded using two binary variables a and h: the responses for the a 
variables are coded as 1 = yes / no and = undecided , and h variable are 1 = yes, = no/ undecided. 
The MLTA model was fitted to these data for D = 0,1,2,3 and G = 1, 2, . . . , 5. The log-likelihood 
for the fitted models are reported in Table |5j 

The BIC and the BIC* (Table [6] and [T]) agree in selecting the model with 4 groups, a bivariate 
latent trait and with slope parameters, w, equal across groups as the best model. 



6.2.1 Analysis of the Groups in the Selected Model 



Table [8] shows a cross tabulation of group membership with party membership (Republican or 
Democrat) for the 435 congressmen. From this table it is clear that Group 4 consists mainly of 
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Table 5: Approximated log-likelihood for the models with D = 0,1,2,3 and G = 1, 2, . . . , 5 
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Table 6: The estimated BIC for the models with D = 0,1,2,3 and G = 1, 2, . . . , 5 
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Republican congressmen, Group 1 and 3 consist mainly of Democrat congressmen and Group 2 
is a small group consisting of members of both parties. Interestingly, even though the Democrat 
congressmen are divided into two voting blocs, these do not correspond to the Northern Democrats 
and Southern Democrats which were perceived to be voting blocs within the party (Table |9|. 
However, of the eleven Democrat congressmen who are in Group 4, nine are Southern Democrats, 
suggesting that they vote in a similar way to the Republican congressmen. It is worth emphasizing 
that the data under consideration only records voting on sixteen key issues and that more voting 
blocs may be revealed if voting on all issues were examined. 



Table 10 gives the parameter estimate bmg and median probability TTmg{0) for each of the 
groups. This table reveals that Group 2 consists of congressmen who have a high probability of 
being undecided on many of the issues. 

The probabilities of a positive response for the a variables for the median individuals in Groups 
1, 3 and 4 are always very large (over the 0.90) with two exceptions in Group 1, for the variable 
number 3 where 7r3i(0) = 0.82 and for variable number 31 where '7r3ii(0) = 0. So, the majority 
of congressmen in the Groups 1, 3 and 4 voted on most issues, but with a slightly lower voting 
rate in Group 1 on the Water Project Cost-Sharing and a very low voting rate on the Export 
Administration Act /South Africa. As a result of these high voting rates, most of the probabilities 
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Table 7: The estimated BIG* for the models with D = 0,1,2,3 and G = 1, 2, . . . , 5 







D = 


D 


= 1 


D 


= 2 


D = 


3 










w fixed 




w fixed 




w fixed 


G = 


1 


12413.64 


9967.02 




9708.10 




9696.32 




G = 


2 


10127.64 


9761.98 


10028.51 


9746.68 


9718.82 


9986.49 


9681.51 


G = 


3 


9883.15 


9614.18 


9801.00 


9871.50 


9698.52 


10204.99 


9729.45 


G = 


4 


9834.05 


9600.57 


9666.16 


10017.47 


9464.28 


10404.27 


9920.84 


G = 


5 


9787.05 


9711.72 


9615.51 


10128.07 


9572.70 


10531.05 


9741.57 



Table 8: Classification table for party in the selected model {G = A, D = 2, w fixed) 





5 = 1 


5 = 2 


5 = 3 


5 = 4 


Democrat 


65 


8 


183 


11 


Republican 


2 


3 


18 


145 



given for b variables {yes versus no/undecided) in these groups can be interpreted as a voting yes 
versus no probability. 

From Table 10 and Figure [2] it is possible to observe that the responses for the median individual 
in the mainly Republican group (Group 4) are the opposite to the ones given by the median 
individuals in the mainly Democrat groups (Groups 1 and 3) for the majority of the issues. 

In fact the median individual in the Republican group (Group 4) has high probabilities to give 
positive responses for the variables 46, 56, 6b, 12b, 13b, 146, which are concerned with the Physician 
Fee Freeze, the El Salvador Aid, the Religious Groups in Schools, the Education Spending, the Su- 
perfund Right to Sue and the Grime; the Democrat groups (Groups 1 and 3) have low probabilities 
of giving these issues a positive response. Additionally, the Democrat groups have high probability 
of positive responses for the variables 36, 76, 86, 96, which are concerned with the Adoption of the 
Budget Resolution, the Anti-Satellite Test Ban, the Aid to the Nicaraguan 'Gontras' and the MX 
Missile, whereas the Republican groups have low probabilities for these issues. 

The choice of a model with the w common in all groups suggests the latent trait has the 
same effect in all groups. The estimated posterior mean of the latent variable y„ conditional on 
Zng = 1 was estimated for each voter and for each group and these are shown in Figures 1-4 of 
the supplementary material. The posterior mean is plotted a number of times in these figures to 
show the impact of the latent variable on outcome variables. The variation in the positions of 
the congressmen in these plots shows how the voting is influenced by the latent trait in addition 
to the baseline median probability of voting as characterized by the intercept parameters b. It is 
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Table 9: Classification table for location and party in the selected model 





5 = 1 


9 = 2 


9 = 3 


g = 4 


Northern Democrat 


48 


7 


121 


2 


Southern Democrat 


17 


1 


62 


9 


Northern Republican 


2 


2 


13 


111 


Southern Republican 





1 


5 


34 



clear from the plots that the latent trait can introduce significant variation in the outcome variable 
within a group if the slope parameter Wg has large magnitude. 

Prom the analysis of the lift (Tables 2-5 of the supplementary material) it is possible to observe 
that: within Group 1 and Group 3 there is a strong positive dependence within the set of variables 
(46, 56, 66, 126, 136, 146); this is shown by lift values in excess of one. Within Group 1 there is also 
a negative dependence between this set of variables and the set (16a, 166), within which there is 
a positive dependence; this is evidenced by the lift values well below one. Within Group 4 there 
is a positive dependence within the set of variables (36, 76, 86, 96, 156), and a negative dependence 
between this set of variables and the variable 26. In Group 2 there is a strong positive dependence 
within the sets of variables (46, 56, 66, 126, 136, 146), (26, 46, 56, 66) and (156, 16a, 166), and a neg- 
ative dependence between the sets of variables (26, 46, 56, 66) and (36, 76, 86, 96, 156, 16a, 166), and 
between the sets variables (126, 136, 146) and (156, 16a, 166). These lift values show that the model 
is accounting for dependence within the groups which may be missed in a latent class analysis of 
these data. 

7 Conclusions 

The mixture of latent trait analyzers model provides a powerful methodology for model-based 
clustering for categorical data. It provides a suitable alternative to LCA for clustering binary data 
and it can be fitted in a computationally efficient manner using a variational EM algorithm. In 
addition, the model includes LCA as a special case so it is also applicable when the LCA modeling 
assumptions are appropriate. The continuous latent trait in model also facilitates investigating 
the dependence between variables, thus offering a combination of the properties offered by the 
Latent Class Analysis model and the Latent Trait Analysis model. The model parameters are 
interpretable and provide a characterization of the within cluster structure. 

The variational EM algorithm proposed for model fitting gives an efficient mechanism for model 
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Figure 2: The median probabilities, 7r.mg(0), for the selected model (G = 4, D = 2, w fixed). 
The variable coded as a records the congress person voting yes/no versus undecided and the vote 
coded as b records the congress person voting yes versus no/undecided. In both cases the median 
probability gives the probability of the first outcome. 
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fitting, thus making the method easily apphcable. 

The model provides an extra model-based clustering framework for categorical data with a 



similar structure to the continuous data methods of mclust (Fraley and Raftery, 2002), MIXMOD 



(iBiernacki et all 120061), EMMIX (iMcLachlan and Peell 119991), pgmm (iMcNicholas and Murphy 



2008) and t-EIG EN (Andrews and McNicholas, 2012). 



The MLTA framework as proposed for the automatic selection using BIG (and a variant) 
of the number of clusters (G) and the latent trait dimensionality (D) required to appropriately 
model dependence within clusters. Alternative model selection techniques that are not based on a 
penalized likelihood criterion could offer a more computationally efficient manner for selecting G 
and D. 

The excellent clustering behavior of this method has been shown by two applications on the 
National Long Term Care Survey (NLTCS) and U.S. Congressional Voting data sets. In both 
cases, the model found groups that were intuitive in their interpretation and fewer groups were 
found than the LCA model suggests. 
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1 Parameter Estimation in the Parsimonious Model 

It is necessary to use an EM algorithm to estimate the parameters in the parsimonious model 
presented in Section 3.1 of the main paper. The mathematical calculations required to derive the 
EM algorithm for the parsimonious model beyond those needed for the MLTA model are contained 
herein. 

We define X = (xi, . . x^v^, Y = (yi . , y^v), Z = (zn, . . . , zjvq^) and H = (^n, . . . , 
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bmli ■ ■ ■ ibmcY 1 7r 



/ _ i^ 
















EiLi^^ncAte 



nmG ) 



M-step 



= argmax(E [log (X|Y, Z, H))]) 

— — /IKmWm + 7 — U 

therefore, 



2 Additional Results 

2.1 National Long Term Care Survey (NLTCS) Example 

Table [1] presents the results mentioned in the example regarding the National Long Term Care 
Survey (NLTCS) (Section 6.1 of the main paper). It shows the comparison of the observed and 
the expected frequencies for the response patterns with more than 100 observations under the best 
model selected {G = 10, = 1). 



2 



Table 1: Expected frequencies for the response patterns with more than 100 observations for the 
selected model 



Response Pattern 


Observed 


Expected 


0000000000000000 


3853 


3851 


0000100000000000 


216 


225 


0000001000000000 


1107 


1028 


0000101000000000 


188 


228 


0000001000100000 


122 


120 


0000000000010000 


351 


351 


0010000000010000 


206 


141 


0000001000010000 


303 


312 


0010001000010000 


182 


168 


0000101000010000 


108 


98 


0010101000010000 


106 


128 


0000000000001000 


195 


200 


0000001000001000 


198 


224 


0000001000101000 


196 


162 


0000001000011000 


123 


110 


0000001000111000 


176 


176 


0010001000111000 


120 


98 


0000101000111000 


101 


111 


0111111111111000 


102 


64 


1111111111111010 


107 


88 


0111111111111110 


104 


125 


1111111111111110 


164 


218 


0111111111111111 


153 


170 


1111111111111111 


660 


506 



2.2 U.S. Congressional Voting Example 

This section reports some details of the results for the best model selected {G = A, D = 2, w 
fixed) for the U.S. Congressional Voting Example mentioned in the paper (Section 6.2). 

We show the plots (Figures [lIQ, mentioned in Section 6.2.1, of the estimated expectations 
fij^g of the latent variable y„. We have a plot for each group, and only the observations with the 
maximum a posteriori probability to belong to that group are shown. 

In the variables coded as a the green circles correspond to the congressmen voting yes /no 
and the yellow triangles to the people undecided. In the variables coded as b the green circles 



3 



correspond to the congressmen voting yes and the yellow triangles to the people no / undecided. 

In the next pages are also reported Tables [2]-[5] showing the estimates of the lift mentioned in 
the paper (end of Section 6.2.1). 
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